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Nowadays, researchers are incorporating many modern and significant 
features on advanced driver assistance systems (ADAS). Lane marking 
detection is one of them, which allows the vehicle to maintain the 
perspective road lane. Conventionally, it is detected through handcrafted and 
very specialized features and goes through substantial post-processing, 
which leads to high computation, and less accuracy. Additionally, this 
conventional method is vulnerable to environmental conditions, making it an 
unreliable model. Consequently, this research work presents a deep learning- 
based model that is suitable for diverse environmental conditions, including 
multiple lanes, different daytime, different traffic conditions, good and 
medium weather conditions, and so forth. This approach has been derived 
from plain encode-decode E-Net architecture and has been trained by using 
the differential and cross-entropy losses for the backpropagation. The model 
has been trained and tested using 3,600 training and 2,700 testing images 
from TuSimple, a robust public dataset. Input images from very diverse 
environmental conditions have ensured better generalization of the model. 
This framework has reached a max accuracy of 96.61%, with an F1 score of 
96.34%, a precision value of 98.91%, and a recall of 93.89%. Besides, this 
model has shown very small false positive and false negative values of 
3.125% and 1.259%, which bits the performance of most of the existing state 
of art models. 
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1. INTRODUCTION 


The collision of automobiles is considered to be one of the most fundamental causes of physical 
injury and sudden death on the road. According to the World Health Organization (WHO), about 1.2 M 
people have died, and a significant portion of the world's 100 M people have been severely injured due to a 
sudden road crash, resulting in substantial economic losses for the nation. In the world as a whole, the 
number of traffic crash deaths is rising day after day. A summary of accidental death in south Asian countries 
is depicted in Figure 1. According to the figures for 2013 and 1016, Thailand ranked first in traffic accidents 
rate among the Asian countries, and those caused about 7,152 innocent people to die [1]. 

Intelligence inspection and vision inspection for autonomous driving are one of the growing 
research domains in the field of artificial intelligence. The recent revolution in computer vision technology 
has strengthened and enriched the application of artificial intelligence (AI) in an intelligent investigation, 
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resulting in a large revolution in self-driving vehicles. Thus, a thousand others are spared from death [2]. 
Autonomous driving research has been the center of attraction because it effectively reduces the risk of road 
accidents [3]. Especially advanced driver-assistance systems (ADAS) have been a very popular and growing 
research domain targeting apprehension, protection, and environmental awareness around vehicles [4]. 
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Figure 1. Death records in various Asian countries due to road accidents in 2013 and 2016 


Lately, there has been a huge advancement in autonomous vehicles technology, including semi- 
autonomous driving technology, vehicle to vehicle communication, self-parking technology, lane markings 
detection, vehicle to infrastructure communication, lane-departure detection technology, and so on [5]. US 
National Highway traffic safety administration has declared lane markings detection as the primary 
requirement for any autonomous vehicles [6]. Besides, lane mark identification is considered to be the most 
significant innovation in road condition analysis by any autonomous vehicle [7]. In addition, knowing the 
lane position helps to avoid any accident that occurs during the lane changing [8]. Again, the application of 
the lane markings determination is not only for lane-keeping improvement but also for implementing traffic 
regulations based on the road's lane markings [9]. 

Even though a lot of research has been done on landmark detection, there are still many unsolved 
challenges, especially for autonomous driving in very diverse environmental conditions [10]. Factors like 
occlusion, rain, shadow, fog, and sunlight create a lot of illusions and result in a challenging and utterly 
unknown environment taught to be handled by autonomous vehicles [11]. Diverse computer vision 
technologies like [12] and [13] are being implemented, targeting various problems. Existing methods mostly use 
the handcrafted and highly specialized feature that suffers from computational complexity and struggle to truckle 
diverse environmental conditions [11]. 


2. RELATED WORK 

Though lane marker detection is a key study issue for self-driving automobiles, it may be difficult 
and time-consuming under a variety of situations and impacts [14]. As a result, a lot of research is being 
conducted to develop more precise, accurate, and reliable lane mark detection technology [15]. Human error 
causes thousands of innocent deaths, including pedestrians and other drivers during driving and especially 
during lane changing. To solve this problem, computer vision techniques like hough transformation [16], 
edge detection [17], template matching [18] are being implemented based on low-level features including 
color, texture, and so on. But none of these techniques is perfect due to the constraints of light, shades, 
clouds, weather, and environmental changes [19]. In many cases, texture feature is used in landmark 
detection, and computer vision technologies like [20] and [21] are frequently being used with machine 
learning classifiers, including AdaBoost [22], support vector machine (SVM) [23], and so on, which 
sometimes results in the inefficient output. On the other hand, some viable methods, like [24], are slow to 
apply in real-life applications, and others [25] suffer from a lack of accuracy. So, to overcome both of the 
abovementioned limitations, researchers are using deep learning models to develop a fast and accurate 
solution. Levi et al. [25] used convolution neural network (CNN) to solve the detection problem by splitting 
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the images into a distinct column called SitxelNet and improving by setting sleekness to constrain between 
the neighbors. However, it was not suitable for all the cases. The specific affinity between the segmentation 
and recognized boundary is insignificant and has fewer numerical results than other methods [25]. In their 
research, Mamun et al. demonstrated a Seg-Net-based model for lane marker detection, though it has an 
overfitting problem and only focuses on lane space [26]. He et al. had developed a unique CNN-based 
solution to alleviate the issues under various illumination effects. The divergence of the road markings 
influenced the result of lane markings detection [27]. Huang combined the spatial and temporal data in the 
CNN framework to detect the lane markings by selecting the lane boundaries [28]. 

Nevertheless, this makes up for low illumination conditions, such as night and rainy times. Li aims 
to develop a more sophisticated method for detecting the low-level lanes in the traffic scene by applying a 
noble framework of the hierarchical neural network, the fusion of CNN and recurrent neural 
networks (RNN). However, it vastly depends on the target, and it will collapse if it misses the target detection 
[29]. Also, the whole process is time-consuming as it did not use any preprocessing for the algorithm [30]. 
Nguyen et al. have developed a fully connected network (FCN) based model for detecting multi-lane 
boundary features and hence determining multiple lanes in the road. But, this model still has a bit higher 
false-positive rate, and it fails to detect the lane if an identical obstacle is positioned nearby the lane [10]. 
Aiming at real-time performance with efficiency, Paszke et al. proposed an agnostic lane detection process 
based on an E-net [31] and Neven et al. layout [32]. Though it improves the application in real-time, it is still 
unable to provide higher performance in the distinct traffic scene condition [11]. The embedding loss-driven 
generative adversarial network (GAN) model has been introduced to avoid the computational cost and 
complexity in the pixel-wise model. The successful detection rate varies significantly because this model can 
only work in fixed scenarios. The presented model is based on a basic encode-decode E-Net architecture that 
was trained to employ differential and cross-entropy loss. The model has been trained and tested with a 
robust publicly available dataset entitled TuSimple and contains more than 3,600 training images and 2,700 
testing images belonging to very diverse weather and environmental conditions, including diversity in 
daytime weather light, shades, rain, and so on. 


3. RESEARCH METHODOLOGY 

The proposed deep learning approach provides a novel technique for landmark determination for 
automated vehicles. It is a simple deep learning model that has been derived from the basic encoder-decoder- 
based model inspired by E-Net [31]. A customized ENet architecture has been trained by the processed 
TuSimple data with a combination of discriminative and cross-entropy losses. The complete layout of the 
proposed model is shown in Figure 2. 


Segmented Images 


Data Processing 


Input Dataset Interfacing Final output 


Predicted Instant 
Segmentation 


Figure 2. Complete workflow diagram of the proposed approach 


3.1. Input dataset and preprocessing 
TuSimple dataset has been used to develop the model because this dataset has adequate images with 
accurate annotation. The training images and the testing images are of good quality, and it ensures the 
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standards of the model. Images of this dataset are annotated with full lane boundary instead of just annotating 
the lanemark. Each image of this dataset is of good quality in 720x1280 px dimension. The dataset comes 
with 3 JSON files that indicate the clips' path having 3,626 image frames, lane position, and the lane's height 
as a list. After the lane features get extracted, a hyper line is drawn to fit every lane's data points. Here, the 
corresponding lane pixel has been converted into binary value for the pixels that do not belong to the lanes to 
create the binary and instant label images. Eventually, the image frames were resized to 224x224 px 
dimension to maintain a constant aspect ratio and to reduce the computational cost. After completing all the 
above data preprocessing steps, the output data is the original image with an instant label and binary label. 


3.2. E-Net architecture 

The original ENet architecture is an encode-decode network in which there are three stages for the 
encoding section and two stages for the decode section. However, the decode section only upsamples the 
output information obtained from encoding stages. Sharing all the data from the encode stages towards the 
decode stages will lead to a lower result considering irrelevant information of inputs data like excluding the 
lane information. Therefore, the original ENet architecture has been customized by dividing the encode 
section into binary and instant segments. Consequently, each unit can carry one particular information 
regarding lane and perform the individual task. Binary segmentation provides information about the pixels 
inherent to the lanes, whereas instant segmentation ensures the proper pixel position of the lane on the 
images. The layout of the customized E-Net architecture is shown in Figure 3. Several small Bottlenecks 
have been formed before going to the main encode-decode sections, for example, normal bottleneck (NB), 
encode bottleneck (EB), and decode bottleneck (DB). Hence, it will reduce the number of features in every 
layer to reduce computational complexity, learn the relevant features more deeply and find the best possible 
training loss. 
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Figure 3. The architecture of the customized E-Net model 


3.2.1. Normal bottleneck (NB) 

In NB, 1x1 convolution has been applied to decrease the channel numbers. 1x1 convolutional layer 
reduces the computational complexity and has the efficiency of the embedding process in the pooling section 
[33]. Dilation convolution with kernel 3x3 has been used to maintain the input data's constant dimension and 
keep the resolution high. Asymmetric convolution has been fixed with kernel 5x1 to reduce the probability of 
overfitting and computational complexity as it convolutes the separate input channel by different filter 
channels. A regular convolution with kernel 3x3 has been utilized for continuing the convolution with the 
same dimension. Besides, a 1x1 dimensional convolution has been implemented to maintain the channel 
number back to the initial numbers of channels. Furthermore, the dropout layer also has been utilized, 
considering regularization to reduce the overfitting. 


3.2.2. Encode bottleneck (EB) 

Max pooling has been executed with the kernel of 2x2 and stride 2x2 in the EB. Besides, 2D 
convolution with the same kernel and stride has been applied to reduce the channel numbers. Also, 1x1 
convolution and dropout layer have been utilized as like as NB. 
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3.2.3. Decode bottleneck (DB) 

Max unpooling has been executed with the same resolution as EB. Transposed convolutional layer 
with kernel 3x3 and stride 2x2 has been used to uplift the encoding features. Also, 1x1 convolution and 
dropout layer have been utilized as like as NB. 


3.2.4. Encoder stage 

The encoder initially received the original, binary label, and instant label images from the data 
processing section. It separated the image information into two segments, binary and instant segmentation, 
and extracted the lane features and instances individually. Two bottlenecks have been used in the encoder 
stage to split the input dataset regarding binary segmentation and instant segmentation. There are one EB and 
four NB in the first bottleneck. In constrain, one EB and seven NB with three dilation and two asymmetric 
convolution layers have been applied to the second bottleneck. Besides, eight NB has four dilations, and two 
asymmetric convolutions have been examined in the bottleneck for binary segmentation and instant 
segmentation. 


3.2.5. Decode stage 

The information that has been obtained from the encode stages needs to decode to have the final 
result from the network. One up-sampling and two NB have been applied in the bottleneck for the binary 
segmentation part and for the instant segmentation part for uplifting the lane information respectfully. Again, 
one up-sampling and NB have been used for the same purpose of boosting the lane information. Eventually, 
the network has provided the predicted output as binary segmentation and instant segmentation images after 
implementing a transposed convolutional layer in the last bottleneck. 


3.2.6. Loss measurement 

Like other machine learning models, the loss has been computed using backpropagation, and the 
model has been updated accordingly to optimize the model. There are two types of casualties that have been 
executed for the two segmented images such as cross-entropy and discriminative loss. The binary segmented 
images preserve the data as 0 and 1. Here the computation of cross-entropy loss has been performed using (1). 


—(y(log(p) + (1 — y) log(1 — p))) (1) 


Since instant-based segmentation finds the exact location of the lane, the discriminative loss has 
been computed in this stage. The model has been designed so that the same label pixel takes a nearby 
position, and different label pixel maintains distances. So, it puts pixels of the same lane in the same cluster, 
and pixels from different lanes go onto other indifferent perspective lanes. The whole process has been 
modeled in three different stages: separation, neighborhood, and regulation. The separation process extends 
the distance between two lene clusters based on a threshold value. On the other hand, the neighborhood 
section reduces the distance of pixels in a particular lane cluster based on a threshold value 6neignp. 
Additionally, the regularization section maintains the origin of any cluster. Decisively, the discriminative loss 
functions are computed by using the following formula. 


DisCioss = LOSSsep + LOSSneighh + LOSSRegu 


1 
Ne 
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Here, N. represents the number of lane clusters, Ne represents the number of elements in the lane cluster, M 
represents the mean value of the instance in the cluster, and x; represents instances. The total loss of the 
network has been computed by accumulating the cross-entropy loss and the discriminative loss. 
Backpropagation has updated the weight of this neural network model by operating on this total loss of the 
network. 


3.2.7. Interfacing 

The output of our proposed model is the immediate segmentation of pixels in each lane on the 
anticipated pictures. The final goal is to superimpose the lane pixels on top of the original input images. 
Hence, the densely based spatial clustering of application with noise (DBSCAN), has been employed to 
interface the predicted lane image with the original input image. DBSCAN performs better and more 
efficiently than most common clustering techniques like K-means and so on, especially for noisy or arbitrary 
clusters [34]. If the lanes are positioned close and random in nature, DBSCAN performs better interface the 
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lane pixels. In this DBSCAN model, the nearest distance has been considered is 0.05 for the same lane pixels. 
A point is considered to be in the same lane if it is close to the mentioned threshold distance. Otherwise, the 
point is considered to belong to another cluster. This whole process iterates until all the points on the lanes 
are converged. 


4. RESULTS AND DISCUSSION 

The preprocessed data has been fed into the model to train, and then the model performance has 
been tested in terms of accuracy and other relevant metrics. Accuracy itself can't guarantee the reliability of 
this model, so relevant parameters like positive rate, false-negative rate, F1 score are also considered in this 
type of classification task [35]—-[38]. The equations regarding performance parameters have been mentioned 
in (2) to (5). 


uantity of actual prediction 
Accuracy value = Q if p (2) 
Total sample data 


actual positive prediction 


Precision value = —————————————— (3) 
true positive+false positive 
Precision valuexRecall value 
F1 score value = 2 x ——{—_ (4) 
Precision value+Recall value 
actual number of positive prediction 
Recall value = P p (5) 


Number of true positive+number of false-negative 


Our model has been trained for a total of hundred epochs with images of dimension 224x224 px 
having a batch size of 16. In this work, PReLu activation has been used to develop this model. Adam 
optimizer has been used with a learning rate of 0.0001 and strides, valid padding setting to optimize the 
proposed model. Eventually, the whole model was developed on an Ubuntu-based Linux environment with 
GTX 1080 Ti GPU unit. The key parameters of the performance of the proposed model have been listed 
below in Figure 4. This model has recorded the highest accuracy value of 96.61 with an F1 score of 96.34, a 
precision of 98.91, and a recall value of 93.89 percent. In addition, the model has shown minimal false- 
positive and false-negative rates, 3.125% and 1.259%, respectively. 
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Figure 4. Performance metrics visualization of the proposed model 


The proposed approach has also been evaluated in distinct epochs like 20, 40, 60, 80, and 100. The 
performance has been listed in Table 1. The performance of this model has increased gradually with respect 
to the epochs. Eventually, this model has gained maximum accuracy at around the 100th epoch, as depicted 
in Table 1. 
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Table 1. Performance result comparison with different epochs 


Epoch Accuracy F1 score Precision Recall 
20 90.1 93.4 94.85 91.2 
40 92.32 94.21 95.01 91.47 
60 94.35 95.66 95.5 92.39 
80 95.45 95.79 97.36 93.2 
100 96.61 96.34 98.91 93.45 


The loss measurement and optimization are very central steps for developing a functional deep 
learning-based model as the lowest possible loss ensures the best-optimized architecture. The overall loss of 
the model has been visualized in Figure 5, representing the gradual decrease of the loss during the training 
process. The accumulated minimal loss of the model has been noted as 5.39%, which indicates the efficiency 
of the proposed model having a minimal loss. Again, as the loss gets reduced in every epoch, the basic 
features of lane marking get extracted from the input images. This process also ensures the minimization of 
the false-positive rate. 


| 
12 ‘ 
M W 


ni h \ 
8} I N 


ne 
ENNA | 
me W iy Wilt ul] LAr itn 
6 ph WW YAAA A N hed My 


Loss 


20 40 60 80 100 120 140 160 180 
Epoch 


Figure 5. Loss visualization with respect to the number of epochs 


The overall performance of the proposed model has been compared with the recent state-of-the-art 
model shown in Table 2. The table shows that our approach is superior to other recent deep learning-based 
lane mark recognition models. The aforementioned model has shown the best performance in terms of 
accuracy, F1 score, precision, and recall, compared to most of the existing relevant research articles. Besides, 
this work has shown the lowest false positive and false negative values compared to the other existing work 
in this domain. As the proposed method's evolutionary result is superior to the current deep learning 
techniques, this method is more efficient for detecting lane marking than others. 


Table 2. Performance comparison of the proposed method and some existing methods 


Methods Accuracy (%) Recall Precision False Positive False Negative F1 score 
Pizzati et al. [39] 95.24 - - 9.42 0.033 - 
Hoe et al. [11] 96.29 - - 3.21 4.28 - 
Mamidala et al. [40] 96.10 - - - - 94.45 
Yoo et al. [41] 96.02 - - 7.22 2.18 - 
Tabelini et al. [42] 93.36 - - 6.17 - - 

He et al. [27] - 93.80 95.49 - - - 
Zhe et al. [43] - - 94.94 2.79 4.99 - 
Tian et al. [19] - 66.4 83.5 - - - 

Proposed Method 96.61 93.89 98.91 3.125 1.259 96.34 


As the whole architecture of the model has been trained through a small bottleneck of convolutional 
layers with fewer parameters, it has reduced the complexity to compute the training process. Besides, the 
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proposed approach has utilized the asymmetric convolutional layers, which reduced the probabilities of 
overfitting and computational cost. Besides, DBSCAN has been executed to interface the predicted 
segmented images instead of using the different convolution-based networks to reduce computational cost as 
well as to provide compatibility with both straight and curved lines. This model was trained and tested with 
the TuSimple dataset, which contains roads frames in very different and complex environmental conditions. 
A few of the input and output images from the lane mark have been shown in Figure 6 to visualize the 
proposed method's outcome. Screen copies of the original image predicted lanemark, corresponding color 
image, and projected lanemark have been shown in the figure. This figure ensures that our proposed model 
can determine lane marking more accurately and precisely than other existing models. We strongly believe 
that this research will significantly contribute to lane mark detection research. 


column 


Figure 6. The output of the proposed model: original image (1“ column), lane prediction in black 
(2™ column), lane prediction in color (3" column), lane projection in the original image (4° column) 


5. CONCLUSION 

ADAS are mainly supporting autonomous vehicles technology, and lane mark detection is one of the 
core components of ADAS. In this research work, a simple encode-decode basis customized E-Net 
architecture has been used to find lane marks in very diverse and practical environmental conditions. 
TuSimple, a robust dataset that includes frames of very diverse environmental conditions, including straight 
lane, low light, shadow, curve lane, and so on, has been used to develop and justify the performance of the 
model. The proposed model has shown better accuracy, F1 score, precision, and recall than the other existing 
model. In addition, the proposed model has less computation complexity compared to the existing models. 
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Besides, our model showed a minimal loss during the training process. Hence, the proposed architecture is a 
better model for lane mark detection, outperforming the state-of-the-art technologies in every performance 
parameter. So, this model will create a significant positive impact in the Al-based ADAS research area. 
However, this result may be improved by training the model with a more robust dataset that contains frames 
of more diverse environmental conditions. 
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